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ABSTRACT 


NoSQL databases are gaining popularity due to their ability to store and process large het¬ 
erogeneous data sets more efficiently than relational databases. Apache Accumulo is a 
NoSQL database that introduced a unique information security feature—cell-level access 
control. We study Accumulo to examine its cell-level access control policy enforcement 
mechanism. We survey existing Accumulo applications, focusing on Koverse as a case 
study to model the interaction between Accumulo and a client application. We conclude 
with a discussion of potential security concerns for Accumulo applications. We argue that 
Accumulo’s cell-level access control can assist developers in creating a stronger informa¬ 
tion security policy, but Accumulo cannot provide security—^particularly enforcement of 
information flow policies—on its own. Furthermore, popular patterns for interaction be¬ 
tween Accumulo and its clients require diligence on the part of developers, which may 
otherwise lead to unexpected behavior that undermines system policy. We highlight some 
undesirable but reasonable confusions stemming from the semantic gap between cell-level 
and table-level policies, and between policies for end-users and Accumulo clients. 
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CHAPTER 1: 

Introduction 


For decades, relational databases have been the preferred method for readily retrievable 
data storage. As data sets have become larger and less structured, inefficiencies have 
emerged with relational databases [1]. The desire to solve these problems led to the devel¬ 
opment of Not Only SQL (NoSQL) databases. Their popularity has grown rapidly during 
the last ten years, and NoSQL databases are now used by several large companies, such as 
Google, Facebook, Twitter, Linkedin, Amazon and others, to manage large data sets. 

Accumulo is a NoSQL database developed by the government primarily to store and pro¬ 
cess large amounts of intelligence data [2]. The Accumulo project was an early developer 
of cell-level access control for NoSQL databases. Recently, other NoSQL projects such as 
HBase have followed suit. Cell-level access control is designed to allow secure access to 
data sets of mixed sensitivity levels. This work attempts to describe the technical aspects 
of Accumulo’s cell-level access control policy enforcement and comment more generally 
on Accumulo’s role in maintaining data security in production applications. 

1.1 Big Data in the Military 

The amount of data human beings generate and consume is increasing exponentially in both 
the commercial sector and in the Department of Defense (DOD). In 2012, it was estimated 
that seven million computing devices were being used in the military to process a 1,600 
percent increase in data since September 2011 [3]. Currently, there are between two and 
five terabytes of data stored for each member of the armed services [4]. Generation of large 
amounts of data does not necessarily translate to good intelligence as analysts can become 
overwhelmed by the volume of data. An analyst attempting to glean intelligence from 
modern data streams has been compared to a person trying to quench his thirst with a fire 
hose. [3]. Attempting to process so much data creates an “operational thrashing” problem 
in which analysts spend more time organizing and preprocessing data than creating action¬ 
able intelligence [5]. To understand the amount of data a typical analyst may have to sift 
through, consider sitting down at a computer and looking through hundreds of thousands of 
spreadsheets each with hundreds of columns and tens of thousands of rows [6]. In a 2012 
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Forbes article, Lt. Gen. Michael Oates, head of the Joint Improvised Explosive Device 
Organization, commented, “There is no shortage of data. There is a dearth of analysis” [3]. 

Generation of actionable intelligence from large data sets requires efficient analysis. Man¬ 
ual analysis of large data sets to develop these insights is unsustainably resource intensive. 
In January 2014, the deputy director of the Defense Intelligence Agency noted, “We’re 
looking for needles within haystacks while trying to define what the needle is, in an era 
of declining resources and increasing threats” [7]. Big data platforms have the storage 
and analytical capabilities necessary to handle large data sets. These solutions can relieve 
the processing burden on human analysts and allow them to spend more time generating 
real intelligence [5]. Big data analytics make information more usable, improve decision 
making, and lead to more focused missions and services. For instance, geographically 
separated teams can access a real-time common operating picture, diagnostic data mining 
can support proactive maintenance programs that prevent battlefield failures, and data can 
be transformed into a common structure that allows custom queries by a distributed force 
composed of many communities [4], [6]. 

Despite the constrained budgetary environment, the DOD continues to invest in big data. 
The DOD spends $250 million a year on big data initiatives, according to Military Times, 
and the FY2015 budget establishes big data investment among its science and technology 
priorities [7]. Several DOD agencies are funding big data programs. For example, the De¬ 
fense Advanced Research Projects Agency (DARPA) MUSE program seeks to improve the 
software engineering process by mining a large corpus of software to find useful properties, 
behaviors, and vulnerabilities and leverage that information to increase software reliabil¬ 
ity [8]. The XDATA program, also backed by DARPA, is developing new computational 
methods and tools for processing big data sets [9]. The Office of Naval Research (ONR) 
Naval Tactical Cloud (NTC) project seeks to improve intelligence distribution across dis¬ 
parate forces using cloud technologies [10]. 

As the DOD develops technologies to analyze and distribute information more efficiently, 
data security becomes more of a concern. Data flowing through mobile devices and across 
land, sea, and air battle spaces creates more opportunities for adversaries to intercept or 
manipulate data [4]. Applications must be developed with these security concerns in mind. 
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1.2 Contributions 

This thesis seeks to determine the role of Aeeumulo’s eell-level seeurity in applieations 
requiring information seeurity. Beeause Aeeumulo doeumentation does not provide a de¬ 
tailed deseription of its operation, we use statie analysis of Aeeumulo souree eode to de- 
seribe Aeeumulo’s arehiteeture and detail its eell-level aeeess eontrol poliey enforeement. 
We diseuss the interfaees between Aeeumulo and elient applieations. Finally, we deseribe 
potential seeurity eoneerns for Aeeumulo based applieations and argue that, while Aeeu¬ 
mulo provides some assistanee to developers in maintaining data seeurity, a signifieant 
portion of the overall seeurity poliey must be enforeed at the elient applieation level. We 
believe our teehnieal survey may assist future study in identifying and mitigating poten¬ 
tial information seeurity vulnerabilities in Aeeumulo or Aeeumulo based applieations. Our 
eomments on potential eoneerns for eonfiguration of Aeeumulo elient and user interaetion 
motivate the need for a more thorough “best praetiee” guide. 

1.3 Thesis Organization 

In Chapter 2, we provide baekground on NoSQL and Aeeumulo. Chapter 3 deseribes Aeeu¬ 
mulo’s data model and software and hardware arehiteeture. In Chapter 4, we diseuss the use 
of Authorizations and ColumnVisibilities to enforee eell-level aeeess eontrol in Aeeumulo. 
This diseussion ineludes a walk-through of those eritieal portions of Aeeumulo eode used 
for poliey enforeement. Chapter 5 provides a general overview of Aeeumulo elient appli¬ 
eations, and Chapter 6 provides a detailed diseussion of Koverse as a ease study. Chapter 7 
is a diseussion of potential seeurity eoneerns for applieations that integrate with Aeeumulo. 
Finally, in Chapter 8 we present eonelusions and topies for further study. 
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CHAPTER 2: 
Background 


In this chapter, we provide an overview of the NoSQL ecosystem and NoSQL security 
concerns as well as a description of Accumulo’s role in the NTC. 

2.1 NoSQL Ecosystem 

NoSQL databases are gaining popularity as developers seek to address problems with tra¬ 
ditional relational databases. A 2012 Couchbase survey asked database system developers 
what they considered to be the most critical problems with relational databases that in¬ 
fluenced their decision to use NoSQL solutions. Of the survey respondents, 49 percent 
identified rigid schemas as a significant problem, 39 percent said lack of scalability, and 29 
percent said high latency [11]. NoSQL databases offer several benefits [12] over relational 
databases, including: 

Reduced complexity. The rich feature set and strict ACID properties of relational databases 
may not be necessary for some data sets. 

Higher throughput. Cassandra writes 2,500 times faster into a 50GB database than 
MySQL [13]. BigTable can process 20 petabytes per day [14]. 

High degree of scalability on commodity hardware. NoSQL databases do not rely on 
highly available hardware and are designed to handle failure efficiently. Data can 
be partitioned across hardware more efficiently than relational database sharding. 
Hardware nodes can be added and removed relatively easily. 

More flexible data model. NoSQL databases are not restricted to the relational data 
model which can be inefficient for unstructured data sets. 

While NoSQL databases address some problems with the relational model, they also 
present their own set of problems. Most notable is the weaker guarantees offered by NoSQL 
databases compared to ACID systems. Brewer’s CAP theorem says that database systems 
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must balance consistency, availability, and partition tolerance and that strong forms of all 
three properties cannot be achieved simultaneously [15], [16]. NoSQL databases gener¬ 
ally sacrifice consistency for increased availability and partition tolerance. In contrast to 
ACID properties provided by relational databases, many NoSQL systems claim to pro¬ 
vide BASE properties—basically available, soft-state, eventually consistent [17]. Another 
weakness of NoSQL databases is the lack of a common interface like Structured Query 
Language (SQL). SQL simplifies and standardizes database manipulation in relational 
databases. NoSQL databases each have a unique programming interface that uses a lower 
level procedural language (e.g., Java) and requires more complex programming than SQL 
to perform the same task [18]. 

Although NoSQL solutions are becoming a larger presence in the database community, 
relational databases continue to be far more prevalent. Table 2.1 shows the ten most used 
databases along with several other NoSQL databases for comparison, as reported by DB¬ 
Engines. According to DB-Engines, the scores are standardized such that a database with 
twice the score is twice as popular. MongoDB and Cassandra are the only two NoSQL 
databases in the top ten and are much less popular than the top relational databases, but 
NoSQL database use is increasing [19], as shown in Ligure 2.1. The 2012 Couchbase study 
claimed that 70 percent of large companies planned to fund NoSQL projects in 2012. Lorty 
percent of companies surveyed said that NoSQL technologies were important or critical to 
daily operations, and an additional 37 percent said NoSQL was becoming important [11]. 

There are many types and implementations of NoSQL databases, but most share some 
common features. The most obvious is that they do not conform to the relational data 
model and are not heavily dependent on tables of data, or any other particular schema. 
They also use a lower level procedural query interface rather than SQL. Linally, NoSQL 
databases scale well horizontally by distributing data across a “nothing shared” network 
of commodity hardware [17], [18]. NoSQL databases are designed to perform in a variety 
of use cases including large volume data storage, large scale data processing, embedded 
(machine-to-machine) information retrieval, and exploratory analytics [12]. 
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Rank 

Database 

Score 

1 

Oracle 

1470.86 

2 

MySQL 

1281.22 

3 

Microsoft SQL Server 

1242.50 

4 

PostgreSQL 

249.85 

5 

MongoDB 

237.36 

6 

DB2 

206.42 

7 

Microsoft Access 

139.62 

8 

SQLite 

88.87 

9 

Sybase ASE 

86.17 

10 

Cassandra 

81.90 

11 

Redis 

70.80 

15 

HBase 

41.92 

18 

Memeaehed 

30.99 

21 

CouehDB 

24.13 

30 

Riak 

11.67 

54 

Aeeumulo 

2.62 


Table 2.1: Popularity of NoSQL databases, as reported by DB-Engines August 2014 rankings, 
after [19]. 


NoSQL databases are grouped in three eategories. Key-value stores are the simplest of 
the NoSQL implementations. They store data in maps, dietionaries, or hash tables [17] 
and use basie put and get operations to write and read entries by key. The value is not 
searehable. Key-value stores feature high sealability and effieient retrieval but laek eomplex 
querying eapability [12]. Examples of key-value stores are Dynamo, Voldemort, Redis, 
Riak, and Memeaehed [12], [18]. Doeument stores add a level of eomplexity to simple key- 
value stores. These NoSQL databases store doeuments [12], typieally in a standard data 
exehange format sueh as XML, JSON, or BSON [17]. Key-value pairs are eneapsulated 
in these sehemaless doeuments. Both keys and values are searehable [17]. MongoDB and 
CouehDB are the most eommon examples of doeument stores [18]. Column-oriented stores 
are modeled after Google’s BigTable design. They store and proeess data by eolumn and 
the keys have multiple attributes. They often integrate with a distributed file system sueh as 
Google File System or Hadoop Distributed File System and a data analytie framework sueh 
as MapReduee [17]. Examples of eolumn-oriented stores are BigTable, HBase, Hypertable, 
Cassandra, and Aeeumulo [18]. 
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DB-Engines Ranking 



Figure 2.1: Trends in NoSQL database popularity, from [19]. 

Accumulo’s design is based on Google’s BigTable [20]. The data model and teehnology 
dependeneies are two major aspeets of BigTable that earry into Aeeumulo. BigTable in- 
trodueed a multi-attribute key that identifies a row, eolumn family, eolumn qualifier, and 
timestamp with eaeh data entry. Entries are stored in Tables whieh are distributed aeross 
eommodity hardware by dividing them into subsets eall Tablets. A Tablet Server proeess 
runs on eaeh BigTable node that manages a set of Tablets. BigTable uses a distributed file 
system, Google File System, for persistent storage, integrates with the MapReduee analytie 
framework, and uses a distributed serviee to manage eoneurreney and eonsisteney of dis¬ 
tributed nodes. All of these properties are also present in Aeeumulo and are diseussed in 
more detail in later ehapters. 

2.2 NoSQL Security 

As the seale of information sharing grows, so does the problem of maintaining the seeurity 
of that information. The growing numbers of information users eombined with more direet 
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access to data requires eloser attention to seeurity polieies and their enforeement. The 
wide use of web interfaees to applications with database backends illustrates this problem. 
Users are given more access through these interfaces, whieh are frequent vietims of cyber 
attaeks. Databases have an important role in maintaining the confidentiality, integrity, and 
availability of data. A compromised database can lead to improper aeeess to data, improper 
modification of data, or loss of access to data. These problems affeet not only the individual 
that owns the eompromised data, but entire organizations and eommunities [21]. 

Okman et al. investigated NoSQL seeurity in more detail using Cassandra and MongoDB 
[22]. They identified the following potential seeurity weaknesses: 

• No encryption mechanism for data 

• Unencrypted communieation with elients 

• Usernames and passwords sent as clear text 

• Option available to enerypt inter-node eommunication but not the default setting 

• No proteetion during bulk data ingest 

• Query languages potentially suseeptible to injeetion attaeks 

• Denial of serviee by thread eonsumption 

• Weak native authentieation and authorization implementations 

• No redundaney in password and permission files 

• Permission files not verified during each request 

The Cloud Seeurity Allianee defines the most critieal information security threats they 
pereeive for big data, grouping these into the following eategories [23]: 

1. Seeure eomputations in distributed programming frameworks 

2. Security best praetices for non-relational databases 

3. Privacy preserving data mining and analyties 

4. Cryptographieally enforeed data eentrie security 

5. Granular access control 

6. Secure data storage and transaetion logs 

7. Granular audits 

8. Data provenance 

9. End point validation and filtering 
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10. Real-time security monitoring 

Of these concerns, the most relevant to our discussion of Accumulo are: 

Security best practices for non-relational data stores 

NoSQL databases have been designed with performance in mind and with few built 
in security features. Much of the NoSQL community relies on middleware to enforce 
security policy. Each NoSQL solution has a unique interface, so developers face the 
challenge of verifying the correctness of middleware security protocols and ensuring 
proper integration with a specific NoSQL database. 

Granular access control 

Big data, in an operational or intelligence context, originates from a variety of sources 
and sensitivity levels. Coarse access control policies may unnecessarily restrict in¬ 
formation that could be used to generate insightful analytics. Liner access control, 
such as Accumulo’s cell-level control, can maximize data sharing while maintaining 
secrecy. 

2.3 Naval Tactical Cloud 

The Unified Cloud Data (UCD) ecosystem was developed by United States Army Intelli¬ 
gence and Security Command (INSCOM) to improve data sharing and analytic capabilities. 
The NTC project seeks to adapt the UCD model for use by the Navy [10]. NTC addresses 
military information dissemination challenges including distribution of data over a tactical 
force, prioritizing data movement in constrained network conditions, representation of data 
for efficient movement across tactical networks, prioritizing data retention and indexes in 
constrained storage conditions, and designing analytics that work across a distributed force. 
NTC plans to meet these challenges by combing semantic web and big data technologies 
to merge data sets from different communities leading to more insightful, actionable intel¬ 
ligence. 

Accumulo is an integral part of the NTC architecture. NTC data is represented in a graph 
structure that defines relationships between data items. This structure makes it easy to add 
new data or merge disjoint data sets. This data model requires the addition of metadata to 
identify nodes of the graph, their properties, and the relationships between them. NTC uses 
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Accumulo to provide distributed storage of raw data items and all metadata necessary to 
integrate data into the graph. 

Graph edges are three tuples that identify a subject, object, and a relationship between 
them. Subjects and objects can be any entity within the context of the data that the graph 
describes. These are referred to collectively as Terms and are stored together in an Accu¬ 
mulo Term Table. Relationships are stored in a separate Predicate Table. The Statement 
Table stores the graph edges via the subject-object-predicate tuples. An Artifact Table pre¬ 
serves the raw input data items prior to graph processing [10, pp. 80-88]. 

Accumulo was chosen for this task, at least in part, because of its cell-level access control 
capability. Fine-grained access control could enhance data availability and thereby enhance 
analytical processing and information dissemination while maintaining information secu¬ 
rity. Unfortunately, the NTC project is still under development and a detailed description 
of how Accumulo’s cell-level access control would be used in NTC is not available. 
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CHAPTER 3: 
Accumulo Overview 


Accumulo is a distributed data storage application developed by the National Security 
Agency (NSA), following and extending Google’s BigTable design [20]. Under pressure 
from the United States Senate Armed Services Committee, the NSA submitted Accumulo 
as an open source project that is now run by Apache [24]. Accumulo is a NoSQL database, 
a term used to describe a large family of data storage solutions that do not adhere to a 
traditional relational database model. Like other NoSQL databases, Accumulo provides a 
simple and flexible data model with restricted query semantics. This simplicity is credited 
as enabling scalability, handling large data sets while maintaining efficient data retrieval 
performance. Benchmarking studies have shown Accumulo to be capable of processing 
hundreds of terabytes of data at rates of over 100 million data entries per second [25]-[27]. 
In contrast to similar column-oriented NoSQL data stores, Accumulo adds cell-level ac¬ 
cess control to its data retrieval model. This chapter describes Accumulo’s data model and 
system architecture. 

3.1 Data Model 

While Accumulo is a column-oriented store, like Google BigTable and Apache HBase, it 
can be viewed as a simple key-value store. The key is composed of five different elements: 
row, column family, column qualifier, column visibility, and a timestamp (see Figure 3.1). 


Row 


Column 

Family 


Column Column 
Qualifier Visibility 


Timestamp 


Value 


Key 

Figure 3.1: Accumulo key-value relationship 

The row, column family, and column qualifier elements are used to uniquely identify a set 
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of timestamped values in Accumulo. All information that will be used to loeate a speeifie 
value must be eneoded in these three elements of the key. The eolumn visibility element is 
used to enforee Aeeumulo’s eell-level aeeess eontrol. Clients present a set of authorizations 
to Aeeumulo, whieh it uses to filter data it returns to the elient based on the poliey in 
eaeh eolumn visibility element. We diseuss the interaetion between authorizations and 
eolumn visibilities in more detail in Chapter 4. The timestamp element is used to implement 
eell-level versioning. Any entries with identieal row, eolumn family, and eolumn qualifier 
elements are assumed to be different versions of the same value field. By default, Aeeumulo 
returns only the most reeent version of an entry. The value element is the raw data stored in 
Aeeumulo. All elements of the key-value pair are stored as byte arrays with the exeeption 
of the timestamp, whieh is stored as an integer. This generie typing of key elements allows 
the Aeeumulo elient flexibility in determining what data types will be used as eaeh part of 
the key-value entry. 

Aeeumulo automatieally sorts data lexieographieally by key upon ingest, so data with sim¬ 
ilar keys are stored together. This strategy allows effieient range queries to take advantage 
of data loeality: related data, whieh is more likely to be aeeessed near the same time, is 
stored near eaeh other, deereasing overall aeeess time. 

Aeeumulo groups sorted key-value pairs into tables. Tables are used to organize and dis¬ 
tribute Aeeumulo entries aeross data storage nodes. Tables ean be split along row bound¬ 
aries into smaller subsets ealled tablets. Tablets are the basie data struetures that are main¬ 
tained by individual nodes in Aeeumulo’s distributed arehiteeture. 

The eombination of table, row, eolumn family, and eolumn qualifier ean be used to apply 
a logieal hierarehy to Aeeumulo data [28]. The key hierarehy is flexible and ean be used 
to organize data in many ways, ranging from a traditional relational table framework to 
eompletely unstruetured data. Figure 3.2 shows how the Aeeumulo key hierarehy might 
be used by an organization to store employee information. Eaeh employee is represented 
as a row in the Employees Table. There is no requirement for eaeh row in a table to 
have the same number or types of eolumns, so eaeh employee eould have different types of 
information stored. In this example. Bob does not have an offiee, so no loeation information 
is stored. In a traditional relational database, the Employees Table would have empty eells 
in Bob’s loeation entries, resulting in ineffieient use of spaee. The key-value data model 
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allows flexible data organization as well as efficient data distribution across the individual 
nodes in Accumulo’s architecture. 


Table:Employees 



Column family 


r t r T r ^ 

T 


r 

1 

Column qualifier sex Phone Email Building Office 

Age 

Sex 

Phone 

Email 



r 

T T » , 

1 

r 

r r T 

Value 

28 

F 555-1111 alice@org.net Bill 

256 

32 

M 555-2222 bob@org.net 


Figure 3.2: Accumulo key hierarchy 


3.2 System Architecture 

Accumulo relies on Hadoop Distributed File System (HDFS) and Zookeeper to provide 
data storage across distributed commodity hardware. HDFS provides Accumulo with dis¬ 
tributed data persistence. Zookeeper manages coordination of concurrent distributed pro¬ 
cesses. Individual components of an Accumulo instance can run on separate machines in 
different geographic locations. 

3.2.1 Accumulo Components 

The main components of an Accumulo instance are a master server, a monitor, one or more 
tablet servers, a garbage collector, and one or more clients. 

Master. The master is responsible for managing tablet servers. It ensures that each tablet 
is assigned to exactly one tablet server and that load is balanced across tablet servers. 
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It manages recovery in the event of a tablet server failure to ensure reliable persis¬ 
tence of tablets. It also handles table management requests (creation, modification, 
deletion) from clients. 

Monitor. The monitor provides a web interface to monitor Accumulo performance. It is 
controlled by the master. 

Tablet server. The tablet server is the main data management component of Accumulo. 
Each tablet server handles a subset of all tablets in the Accumulo instance. The main 
function of a tablet server is to handle read and write requests from clients. In re¬ 
sponse to a write request, the tablet server saves new data in memory in the memtable 
data structure, sorts key-value pairs in memory, and periodically writes sorted key- 
value pairs to HDFS for permanent storage. The tablet server also make entries about 
write events in a write-ahead log, to provide an efficient mechanism for tablet server 
failure recovery. In response to a read request, the tablet server provides to the client 
a sorted set of the requested key-value pairs, by merging data stored in HDFS and 
memory. 

Garbage collector. The garbage collector ensures efficient use of HDFS storage space by 
identifying and deleting files that are no longer used by any process. 

Client. Accumulo provides a client Application Programming Interface (API) that con¬ 
tains interfaces for connecting to an Accumulo instance and executing read and write 
requests. 

3.2.2 HDFS Components 

The main components of HDFS are a name node, a secondary name node, a job tracker, 

one or more data nodes and one or more task trackers. 

Name node. The name node is the master process in HDFS. It controls the HDFS names¬ 
pace and client access to HDFS files. It keeps track of where in HDFS each individual 
file is stored. 
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Secondary name node. The secondary name node tracks HDFS state information that is 
used by the name node at startup. It is not a backup for the name node. 

Data node. The data nodes store files in HDFS. 

Job tracker. The job tracker manages MapReduce jobs. It divides each job into tasks and 
assigns them to task trackers. 

Task tracker. Task trackers perform work necessary to execute MapReduce jobs. They 
perform tasks assigned by the job tracker. 

Each of the Accumulo and HDFS components are implemented by separate processes. 
Production implementations of Accumulo may co-locate these processes, if appropriate, 
depending on hardware, performance and availability. Although it is possible to run all 
Accumulo processes on one machine, an effective implementation will distribute workload 
across multiple machines [29]. 

3.2.3 Hardware Architecture 

Accumulo uses a distributed network of hardware to provide scalable data storage. Fig¬ 
ure 3.3 illustrates the interaction between Accumulo components in a possible distributed 
architecture. Each gray box indicates a separate physical machine, blue circles are pro¬ 
cesses, and green rectangles highlight notable data structures. Arrows indicate communi¬ 
cation between physical components. 
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Figure 3.3: Accumulo architecture 


Client machines communicate with Zookeeper and tablet servers to make read and write 
requests. Clients may communicate with the master to perform administrative tasks and 
table operations (e.g., table creation). Zookeeper maintains consistent configuration and 
status information for all tablet servers. The master communicates with the individual 
tablet servers to distribute tablet load and respond to tablet server failure, and communicates 
with Zookeeper to promulgate tablet server status. The namenode communicates with the 
tablet server to provide the location of data in HDFS. It manages individual datanodes to 
ensure proper data distribution throughout HDFS. The job tracker communicates with the 
individual task trackers to execute MapReduce jobs. The secondary namenode maintains 
state information for the namenode, to be used if the namenode is restarted. 
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CHAPTER 4: 

Accumulo Cell-Level Policy Enforcement 


Accumulo was the first to implement cell-level access control in the domain of NoSQL 
databases [30]. Databases generally grant user access permission at the table level [31], 
and in some cases, additional algorithms or data structures can be used to implement row 
or column level access control [32]. Accumulo’s cell-level security is native functionality 
that gives system administrators tighter control of user access to data. With coarser data 
access control, an administrator may have to make a choice between data security and 
availability. If an entire table, column, or row is restricted, there may be information within 
that dataset that should be accessible but is restricted to keep the other data in the dataset 
secure. Accumulo cell-level access control provides flexibility that prohibits access to data 
in accordance with policy, while maximizing access to other data [33]. 

Accumulo’s fine-grained access control is implemented by a column visibility label that is 
attached to each key-value pair. Clients that query the Accumulo database must provide a 
set of authorizations that are compared against column visibilities to determine if the client 
has access to each key-value pair. Accumulo only returns those entries that are accessible 
by the client. In this chapter we examine the process that Accumulo uses to enforce cell- 
level data access control. 

4.1 Column Visibility 

An Accumulo column visibility is a security label that is applied to each key-value pair. 
Although the column visibility is described in Accumulo documentation as an element of 
the key, it is not used to identify or locate data. Rather, it is an additional piece of metadata 
that is used to filter key-value pairs that are returned to the client. The visibility label is 
implemented as a Java ColumnVisibility object that becomes part of the key in each entry 
upon insertion. Within each ColumnVisibility object is a boolean expression that describes 
the authorizations needed to access the respective entry. A ColumnVisibility object stores 
the visibility expression in two ways. The first is a character string representing the raw 
boolean expression. The second is the root node of a binary tree describing the visibil¬ 
ity expression. The ColumnVisibility object parses the visibility expression and generates 
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a tree during initial eonstruetion of the objeet. Client eode that queries the Aeeumulo 
database must present authorizations that satisfy the boolean expression in order to retrieve 
a partieular entry. 

4.1.1 ColumnVisibility Expression Syntax 

The visibility expression is a boolean expression that deseribes a set of authorizations that 
must be provided to gain aeeess to the data. The expression relates a set of tokens through 
logieal eonjunetion and disjunetion. Syntaetieally, tokens are represented by eharaeter 
strings, eonjunetion by the “&” eharaeter, and disjunetion by the “|” eharaeter. Conjunetive 
phrases must be grouped separately from disjunetive phrases using parentheses to explieitly 
indieate preeedenee of operations. Beyond this minimum requirement, additional paren¬ 
theses may be used as desired to group individual tokens or groups of tokens. Token strings 
in the visibility expression do not need to be quoted unless non-standard eharaeters are re¬ 
quired. Standard eharaeters inelude alphanumeries, underseore, hyphen, eolon, period, and 
frontslash. If the token is quoted, any eharaeters ean be used with the exeeption of baek- 
slash and double quotes. These eharaeters must be prefaeed by a baekslash when used in 
quoted strings. 

Figure 4.1 is a eontext-free grammar representation of the ColumnVisiblity expression syn¬ 
tax. Non-terminal symbols are enelosed in angled braekets. Terminal symbols are enelosed 
in single quotations. Braees indieate a set of ASCII eharaeter terminal symbols. Within 
braees, the earat represents a logieal negation indieating that the subsequent eharaeters are 
not part of the set. The dash indieates a range of ASCII eharaeters. Table 4.1 provides 
examples of valid and invalid visibility strings. 


Valid 

Invalid 

(one&two) (three&four) 

one&two three 

((A)&(B))|C|(D) 

A!&B# 

./a/:2-_-&:b:__-4/: 

"1234&"&5678" 


"abc\l23" 


Table 4.1: Examples of valid and invalid ColumnVisibilities 
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<VISIBILITY> 


<ANDS> 


<ORS> 


<TERM> 


<QUOTECHARS> 


<QCHAR> 


<NOQUOTECHARS> 


<NQCHAR> 


'(' <VISIBILITY> ') 

<TERM> 

<ANDS> 

<ANDS> 

<ORS> '1' 

<ORS> 


'(' <ANDS> 

' ) ' 

<TERM> 


'(' <ORS> ' 

') ' 

<ANDS> 

<ANDS 


'(' <ORS> 

') ' 

<TERM> 


'(' <ANDS> 

' ) ' 

<ORS> '1' 

<ORS> 


: := ' (' <TERM> ') ' 

I ' " ' <QUOTECHARS> '"' 
I <NOQUOTECHARS> 


= <QCHAR> 

I <QCHAR> <QUOTECHARS> 


::= ["\"] 
I 'W 

I ' \ ' 


= NQCHAR 

I NQCHAR NOQUOTECHARS 


::= [a-zA-Z0-9_-:./] 


Figure 4.1: ColumnVisibility expression syntax as a context free grammar 


4.1.2 Parsing a ColumnVisibility 

A ColumnVisibility object contains a parsing algorithm that is used to generate a binary 
tree from the boolean visibility expression. The tree is used to facilitate authorization 
checking during a query. The parser scans the expression left to right looking for “&” and 
“I” characters which become the root nodes of subtrees within the parse tree. The leaf 
nodes of the parse tree are the terms of the visibility expression. 

Each node is a Node object containing three pieces of information: range, type, and chil- 
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dren. The Node range is defined by a start integer and an end integer whieh are indexes into 
the ColumnVisibility expression eharaeter array. These two integers indieate a portion of 
the ColumnVisibility expression, beginning with start and up to but not including end, that 
is encompassed by the subtree beginning with that Node. The type is an integer indicating 
whether the Node is the root of an AND or an OR subtree, or a TERM leaf Node. The 
child nodes are stored as a list of Node objects. Figure 4.2 is the parse tree for an example 
ColumnVisibility expression. Arrows in the tree indicate child Nodes. 


expression: (red&green) | (blue&orange) 

array index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 



Figure 4.2: Example ColumnVisibility parse tree 


4.2 Authorizations 

Authorizations are security tokens provided by the client when querying the Accumulo 
database. Each token is a string that is intended to identify some level of data access au¬ 
thority. When a user is created in Accumulo, it is assigned a set of authorizations, stored in 
an Authorizations object. Any client that connects to Accumulo as that user, must submit 
a subset of the authorizations stored in the user account. The client may choose which 
of the authorizations are necessary for each query. Any query that is submitted with au¬ 
thorizations outside of the set stored in the user account will fail. If the client provides 
an appropriate subset of authorizations, the provided authorizations are compared to the 
ColumnVisibility expression associated with each key-value pair in the requested range of 
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entries. If the authorizations satisfy the ColumnVisibility expression, that key-value pair is 
returned to the client. 

For the remainder of this chapter, we provide a detailed guide through the Accumulo code 
that processes authorizations during data queries. We use Accumulo version 1.5.0 as the 
reference source code [34]. The discussion is divided into three parts. First, the client 
code determines the appropriate authorizations and sends them with the data query to the 
tablet server. Next, the tablet server receives the query request from the client and retrieves 
the appropriate key value pairs. Finally, we discuss the policy enforcement point at which 
the tablet server filters the results that are returned to the client. Filtering is based on a 
comparison of the client authorizations and the column visibility associated with each key- 
value pair. 

Throughout this discussion, we reference three Java constructs: objects, fields, and meth¬ 
ods. Objects are italicized for clarity. Fields, or variables within objects, are further dif¬ 
ferentiated using bold font. Methods, or object functions, are also bold but have a set of 
parentheses at the end of the name. The first time we reference each construct, we present 
the full package name to establish its location within the source code. Subsequent refer¬ 
ences include only that portion of the name necessary to avoid ambiguity. We do not cover 
all of the Accumulo code used to process queries, and the arguments noted for each step are 
not necessarily all the arguments required to properly execute that portion of code. These 
omissions allow us to focus on authorization processing within the query framework. 

4.2.1 Client Authorization Handling 

To query data, Accumulo client applications must first connect to an Accumulo instance 
using valid user credentials. Using a set of authorizations, the client creates a scanner that 
utilizes that connection to retrieve the appropriate data. The scanner provides an iterator 
framework that the client uses to step through the results of the query. The iterator sends 
the query request to the appropriate tablet server and supplies the results to the client. 
Figure 4.3 shows the flow of authorizations through the client code. 
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Figure 4.3: Client side Authorizations flow 

Connect to Accumulo instance 

• The client connects to an Accumulo instance by instantiating accumulo.core.client.Connector 
using the appropriate username and password 

• The code implementing a Connector object is supplied in 
accumulo. core, client, impl. Connectorlmpl 


Create a scanner 

• The client obtains authorizations directly from user or from 3rd party authentication 
service 

• The client instantiates accumulo.core, security.Authorizations using user authoriza¬ 
tion strings 

• The client calls Connector.createScannerQ with Authorizations as an argument 

• createScannerQ instantiates accumulo.core.client.Scanner 

• Scanner implementation code is supplied in accumulo.core.client.impl.Scannerlmpl 
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Scanner retrieves results from tablet server 

• When the client iterates through the Scanner results, the ScannerImpl.iterator() 
method is called 

• iteratorO uses Authorizations to instantiate a accumulo.core.client.impl.Scannerlterator 

• Scannerlterator constructor uses Authorizations to instantiate 
accumulo. core, client, impl. ThriftScanner. ScanState 

• Scannerlterator.runQ calls ThriftScanner.scanQ with ScanState as an argument 

• scanQ calls accumulo.core.tabletserver.thrift.TabletClientService.Client.startScan() 
with Authorizations as an argument 

• startScanO calls Client.send_startScan() with Authorizations as an argument 

• send_startScan() instantiates TabletClientService.startScan_args and stores Autho¬ 
rizations in startScan_args.authorizations 

• send_startScan() calls Client.sendBaseQ with the string “startScan" and startScan_args 
as arguments 

• sendBaseO implementation code is supplied in thrift.TServiceClient.sendBase() 

• sendBaseO sends “startScan" to tablet server then calls startScan_args.writeO 

• writeQ calls startScan_args.startScan_argsStandardScheme.writeO with startScan_args 
as an argument 

• writeQ sends each argument from startScan_args to the tablet server sequentially 

• Client.startScanQ calls Client.recv_startScanO to get results from tablet server 

4.2.2 Tablet Server Authorization Handling 

The tablet server receives the query request, including authorizations, from the client. The 
tablet server first checks the authorizations against those stored in the user account. If 
the authorizations are a subset of the user account authorizations, the tablet server creates 
an iterator to scan the appropriate tablet or tablets. The iterator filters results based on 
comparison of authorizations and column visibilities. Figure 4.4 illustrates server side 
authorization flow. 
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Figure 4.4: Server side Authorizations flow 


Start tablet server 

• accumulo.start.Main starts an accumulo.server.tabletserver.TabletServer process 

• TabletServermainO calls TabletServer.run() 

• Tabletserver.runQ calls TabletServer.startTabletClientServiceQ 

• startTabletClientServiceO 

- instantiates a TabletServer.ThriftClientHandler object 

- uses ThriftClientHandler to instantiate an 

accumulo.core.tabletserver.thrift. TabletClientService.Iface objeet 

- uses Iface to instantiate a TabletClientService.Processor objeet 

- ealls TabletServerstartServerQ with Processor as an argument 

Initialize scan 

• The TabletServer reeeives the elient query request with “startSean" method indieated 

• Processor maps “startSean" string to eall to ThriftClientHandlerstartScanQ method 

• TabletServer exeeutes startScan() with client Authorizations as an argument 

• startScanO 
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- calls accumulo.server.security.SecurityOperation.getUserAuthorizationsO with 
client user credentials as an argument 

- verifies that Authorizations are a subset of the authorizations listed in the client’s 
Accumulo user account 

- calls onlineTablets.getO to locate the appropriate accumulo.server.tabletserver.Tablet 
to fulfill the client request 

- instantiates a TabletServer.ScanSession and stores Authorizations in ScanSes- 
sion.auths 

- instantiates a Tablet.Scanner hy calling Tablet.createScannerQ with Authoriza¬ 
tions as an argument 

- stores Scanner in ScanSession.scanner 

- calls ThriftClientHandler.continueScanO with ScanSession as an argument 

Iterate through requested data 

• continueScanQ calls accumulo.server.tabletserver 

. TabletServerResourceManager.executeReadAheadQ with 
ScanSession.NextBatchTask as an argument 

• executeReadAheadQ calls NextBatchTask.runQ 

• run() calls ScanSession.Scanner.readQ 

• readQ 

- instantiates Tablet.ScanDataSource with Authorizations as an argument 

- uses ScanDataSource to instantiate 

accumulo.core.iterators.system.SourceSwitchingIterator 

- calls Tablet.nextBatchQ with SourceSwitchingIterator as an argument 

• nextBatchO calls SourceSwitchinglterator.seekQ 

• SourceSwitchingIterator.seekO calls ScanDataSource.createlteratorQ 

• createlteratorQ uses Authorizations to instantiate 
accumulo. core, iterators, system. VisibilityFilter 

• VisibilityFilter constructor uses Authorizations to instantiate 
accumulo. core, security. Visibility Evaluator 

• SourceSwitchingIterator.seekO calls ScanDataSource.readNextQ 

• readNextQ calls VisibilityFilter.seekO 

• VisibilityFilter.seekO implementation code is supplied in accumulo.core.iterators.Filter.seekO 
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• FilterseekQ calls Filter.findTopO 
Check visibility of each key-value pair 

• Filter.findTopO calls VisibilityFilter.accept() with the key and value as arguments 

• acceptQ calls VisibilityEvaluatorevaluateQ with accumulo.core.security.ColumnVisibility 
taken from the key as an argument 

• evaluateO verifies that the Authorizations satisfy the ColumnVisibility expression 

4.2.3 Checking Authorizations against Visibilities 

The policy enforcement point of Accumulo’s cell-level security is the comparison of client 
sup^hed Authorizations against the ColumnVisibility expressions in each key-value pair. At 
this point, Accumulo decides whether to return data to the user. The Accumulo construct 
that performs the comparison is the Visibility Evaluator. The Visibility Evaluator uses the 
parse tree constructed for the ColumnVisibility expression and evaluates it against the Au¬ 
thorizations. The VisibilityEvaluator starts at the root of the ColumnVisibility parse tree 
and works toward the leaves. It checks the type of each Node in the tree to determine if it 
is a leaf Node. If the Node is a leaf Node, the VisibilityEvaluator checks whether the au¬ 
thorization token associated with that Node is present in the Authorizations provided by the 
client. If the Node is not a leaf Node, the VisibilityEvaluator evaluates the Node’s children. 
Accumulo will not return data to the client unless the client supplied Authorizations satisfy 
the entire boolean expression described by the ColumnVisibility parse tree. 

The evaluation algorithm is performed by the VisibilityEvaluatorevaluateQ method. The 
Authorizations are stored as a field of the VisibilityEvaluator object. The algorithm begins 
by examining the root Node. If the root Node is a TERM, evaluateQ returns the result 
of Authorizations.contains(term). If the root Node is an AND or an OR Node, evaluateQ 
is called recursively on the child Nodes. An AND Node will return TRUE if both of its 
children return TRUE. An OR Node will return TRUE if any of its children return TRUE. 
evaluateQ returns TRUE if the Authorizations satisfy the full ColumnVisibility expression, 
otherwise it returns EAESE. Pseudocode for the evaluateQ method is shown in Eigure 4.5. 
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Boolean evaluate(Node) { 

if Node.type == TERM: 

return Authorizations.contains(Node.term) 

if Node.type == AND: 

for child in Node.children: 

if !evaluate(child) return FALSE 
return TRUE 

if Node.type == OR: 

for child in Node.children: 

if evaluate(child) return TRUE 
return FALSE 


Figure 4.5: Pseudocode for evaluate() algorithm 
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CHAPTER 5: 

Accumulo Client Applications 


Accumulo provides a client API that allows applications to programmatically interact with 
Accumulo data. Client applications typically add more data management features and anal¬ 
ysis capabilities such as a graphical user interface to browse data, a more expressive query 
language, data processing libraries for interpreting raw data, or graphical output for sim¬ 
pler consumption by end-users. This chapter introduces interaction between Accumulo 
and client applications and discusses some representative example applications using Ac¬ 
cumulo. 


5.1 Key Accumulo Client Interfaces 

There are some Accumulo interfaces that are commonly used by client applications inde¬ 
pendent of implementation [29]. These interfaces allow Accumulo clients to connect to 
an Accumulo instance, write data to tables in Accumulo, and retrieve specific data entries 
from Accumulo. 

Connector. To connect to an Accumulo instance, the client creates a Connector object. 
The Connector is constructed based on the location of the Accumulo master, and the 
credentials for the user on behalf of whom the client application is operating. The 
Connector establishes the line of communication between the client and Accumulo. 
BatchWriter. Once connected to Accumulo, the client writes data using a BatchWriter 
object. The BatchWriter is constructed using the name of the destination table. The 
elements of the key-value pair are stored in a Mutation object which the BatchWriter 
sends to Accumulo. 

Scanner. To retrieve data, the client uses a Scanner object. A Scanner is constructed using 
the name of the table, the authorization tokens used to access the data, and the range 
of data requested. The Scanner provides an iterator that the client uses to step through 
the results of the scan. 

These interfaces form the foundation of client interaction with Accumulo and allow client 
applications to perform basic write and read operations to store and retrieve data. Figure 5.1 
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is a sample of code demonstrating a client connecting to Accumulo, storing an entry, then 
retrieving that entry. This example assumes that the user username and table tableName 
have already been established in the Accumulo instance instanceName. 


//Connect to Accumulo instance 

Instance instance = new ZooKeeperInstance{"instanceName"zooServerName"); 

Connector conn = instance.getConnector("username",new PasswordToken("password")); 

//Create entry 

Mutation mutation = new Mutation("rowName"); 
mutation.put("columnFamilyName", 

"columnQualifierName", 

new ColumnVisibility("visibilityName"), 

System.currentTimeMillis(), 

"entryValue"); 

//Store entry in Accumulo 
BatchWriter writer = conn.createBatchWriter("tableName",new BatchWriterConfig()); 
writer.addMutation(mutation) ; 
writer.close() ; 

//Retrieve entry 

Scanner scan = conn.createScanner("tableName",new Authorizations("visibilityName")); 
for {Entry<Key,Value> entry : scan) { 

System.out.printIn(entry.getValue().toStringO); 

} 


//row id 
//column family 
//column qualifier 
//column visibility 
//timestamp 
//value 


Figure 5.1: Accumulo client code example 


5.2 Multi-User Client Applications 

When using Accumulo as part of a data management system, the client application will 
likely have many users that need to access different subsets of data. Accumulo provides 
cell-level security labeling to facilitate data segregation in a multi-user environment. Client 
applications, however, are not intended to run under the identity of different users or to au¬ 
thenticate to Accumulo under the identities of different users [29]. Instead, one Accumulo 
user is created and the client application accesses Accumulo through that user’s creden¬ 
tials. The client application access control policy must include a strategy for associating 
the appropriate set a privileges with each user. 

There are two sets of privileges client applications manage. Administrative permissions 
include user management, system settings, and the ability to access or modify tables. Cell- 
level authorizations allow users to access Accumulo entries. 
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Accumulo provides no assistance to the client in managing administrative permissions as¬ 
sociated with application users. The Accumulo user necessarily has all administrative per¬ 
missions required for the client application to function on behalf of any application user. 
Thus, an application user has all of the same permissions as the Accumulo user, unless the 
client takes steps to restrict user activity. 

Controlling user access to data is a distinct problem from managing administrative permis¬ 
sions. The ability to access individual entries in Accumulo is dependent on the Authoriza¬ 
tions provided by the user. The Accumulo user is assigned Authorizations covering the 
entire set of Authorizations any application user might need. It is the responsibility of the 
application to verify user credentials and prepare an appropriate set of Authorizations that 
reflect the user’s permissions prior to querying Accumulo. Accumulo’s cell-level security 
integrates security label processing into the database, but the client application must have a 
reliable procedure in place for verifying the identity of its users and associating appropriate 
Authorizations with each query submitted by a user. 

5.3 Accumulo Client Examples 

Accumulo clients can be simple applications that use only the native Accumulo client API 
(Figure 5.2(a)), or they can scale to much larger applications that provide a more abstract 
user interface and integrate with other applications (Figure 5.2(b)). We provide three ex¬ 
amples of Accumulo client applications that illustrate the range of potential use cases. 
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user 


user ^ 
user 





Accumulo 


user *• 
user ^ 



write data and 
respond to queries 


(a) Simple Accumulo Client 


(b) Non-trivial Accumulo client 


Figure 5.2: Examples of Accumulo client structure 


5.3.1 Trendulo 

Trendulo [35] is a demonstration application developed for Accumulo. It is composed of an 
ingest application to store Twitter data in Accumulo and a web application that allows users 
to query the Twitter data. The ingest application is written in Java and interfaces with the 
Twitter Streaming API. The web application is written in HTML, JavaScript, and Java and 
incorporates several open source application development tools including Spring, JQuery, 
Bootstrap, ICanHaz, and Highcharts. It can be deployed using an open source web server 
such as Apache httpd or Nginx. Web application users issue simple queries to view Twitter 
trend data, showing frequency of target keywords over various time periods. Trendulo does 
not use column visibilities and does not differentiate between individual users. As a result, 
Trendulo provides little insight into the data security features of Accumulo. 

5.3.2 Sqrrl 

Sqrrl is a company founded by Accumulo developers that provides a large-scale enterprise 
data management solution. Sqrrl Enterprise [36] uses Accumulo to facilitate real-time ap¬ 
plication development. It provides its own methods for streaming data ingest and for batch 
ingest of static data (i.e., of ISON or CSV data). It also implements security controls that 
incorporate Accumulo’s cell level security. It can identify and authenticate users, provide 
automatic data labeling based on organizational policy, and provide data encryption for an 
additional level of security. Sqrrl Enterprise also enables complex data analysis through 
additional data models, more expressive query languages, an indexing framework, and cus- 
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tom iterators. Sqrrl Enterprise is an example of a production application that implements 
user management policies; however, lack of access to this application prevents us from 
analyzing it as a case study. 

5.3.3 Koverse 

Koverse [37] is a data storage and analysis framework that is focused on operationalizing 
large amounts of data. Koverse automatically processes data upon ingest to store it in a 
consumable form. It uses role-based access control to manage multiple users but relies on 
third party applications to make use of Accumulo’s cell-level security. Koverse provides 
data analysis algorithms that can merge data sets and identify meaningful relationships 
within large data sets. Koverse also provides support for developers to extend Koverse 
capabilities into custom applications. Koverse is the data query interface for the Naval 
Tactical Cloud project [10]. In the next chapter, we examine Koverse as a case study in 
how Accumulo is integrated into a production application. 
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CHAPTER 6: 

Accumulo Client Case Study 


Koverse is a large-scale data management application that uses Accumulo for persistent 
data storage. In this chapter, we examine Koverse as a case study of Accumulo client 
application design. We do not provide a complete description of Koverse functionality. 
Instead, we focus on components of Koverse that illustrate interaction with Accumulo. 
We describe user management in a multi-user environment with a focus on data access 
authorization. Additionally, we illustrate the process of executing queries in Koverse to 
include the transformation of a user query into an Accumulo Scanner. These core functions 
form the foundation of Accumulo client design. 

6.1 Architecture 

The Koverse application has two main components—the Koverse server and the Koverse 
web application. The web application is the front end for Koverse and provides a graphical 
user interface. It is written in HTML, JavaScript, and Java and uses the JBoss development 
framework [38]. Built in applications within the web interface allow users to: 

• Manage data collections 

• Import data from external sources 

• Query data 

• Analyze data through transforms 

• Manage users and groups 

Users can perform all Koverse functionality through the web interface, but it is also possible 
to interact directly with the Koverse server using the Koverse API. 

The Koverse server processes requests from the Koverse web application. It is written 
in Java, and interacts with the Accumulo client API. The Koverse server also interacts 
with third-party applications to perform authentication and security token assignment for 
Koverse users. Figure 6.1 illustrates component interaction in a Koverse environment. 
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Figure 6.1: Koverse application architecture 


6.2 Data Model 

Koverse overlays its own data model on top of the Aeeumulo data model. Koverse stores 
data in Records which are sets of key-value pairs, referred to as Fields. The Field val¬ 
ues can be one of several native data types (e.g., strings, integers, floating point numbers, 
timestamps, geospatial data). Field names must be unique within a Record, but may be 
reused across Records. Fields are not strongly typed, so two different Records using the 
same Field name may have different types associated with those Fields. Records can pro¬ 
vide more complicated structure to data by nesting additional Fields within a Field value. 
Koverse applies security labels, analogous to Aeeumulo ColumnVisibility objects, at the 
Record level. When a Record is written to Aeeumulo, each Field is mapped into Accu- 
mulo’s column-oriented model (see Table 6.1). 

Each Koverse Record is assigned to one Collection. A Collection forms a set of related 
Records. Collections are schema-less, and each Record in the Collection can have a unique 
Field structure, including the number of Fields and the types of each Field. Mappings of 
Records to Collections and user Collection permissions are stored in a Java Persistence 
API (JPA) [39] system that is separate from Aeeumulo. No information indicating the 
Collection corresponding to a Record is stored in Aeeumulo. 
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A Collection can be thought of as a table in a relational database where each Record is a row 
and the Fields are columns. Collections, however, do not map to Accumulo tables. Koverse 
stores all data in two distinguished tables in Accumulo: the index table and record table. 
The index table stores a portion of each Record, organized in a way that improves query 
execution. The record table stores all Koverse Records regardless of which Collection the 
Record is associated with. Each entry in the record table is one Field of a Koverse Record. 


Accumulo entry 

Koverse Record 

Row ID 

Reeord ID 

Column Family 

not used 

Column Qualifier 

Field name 

Column Visibility 

Koverse seeurity label 

Timestamp 

applied by Aeeumulo 

Value 

Field value 


Table 6.1: Mapping a Koverse Record to an Accumulo entry 


6.3 User and Group Management 

Koverse can manage many concurrent users. Users are identified by a username and email 
address and authenticate using a password. Koverse administers role-based privileges 
through groups. Each group is given a set of privileges, and a user assumes the privi¬ 
leges of all the groups it is assigned to. User and group information is stored in the JPA 
system. Although there are many distinct Koverse users, the Koverse server accesses Ac¬ 
cumulo through a single user (by default, the Accumulo root user). The Koverse server has 
the full set of permissions in Accumulo which includes reading, writing, and modifying 
any table or data entry. Controlling access to data stored in Accumulo is completely depen¬ 
dent upon Koverse’s ability to associate appropriate privileges with the Accumulo requests 
it performs on behalf of users. 

To authenticate, a Koverse user provides login credentials through the Koverse web inter¬ 
face. By default, Koverse manages user authentication locally. Koverse compares the user 
credentials against the stored login information for that user. If the credentials match, Ko¬ 
verse creates a session for the user. Koverse can also utilize a third-party authentication 
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service. In that scenario, when the user submits credentials through the Koverse interface, 
Koverse forwards the credentials to the authentication service which verifies the user iden¬ 
tity. Koverse verifies the response from the authentication service, and if the user provided 
appropriate credentials, creates a session for the user. 

Once the user has authenticated, it must obtain authorization to perform tasks. For admin¬ 
istrative tasks—such as user and group management. Collection configuration, and audit 
log access—Koverse checks the groups associated with the user. If any of the groups has 
permission to perform the requested task, the user is granted access. 

To access data in a Record, users must have permission to access the Collection associated 
with that Record. During a query, Koverse checks the user’s groups, and then checks if any 
of those groups have access to the Collection. If one of the user’s groups has access to the 
Collection, the user is granted access to the Collection. 

Access to a Collection does not guarantee that the user can access all Records in that Col¬ 
lection. If any of the requested Records have security labels, the user must provide an 
appropriate set of tokens to access those Records. Koverse does not natively manage the 
security tokens for each user. To obtain the necessary tokens, the user must authenticate 
to a third-party service. Koverse submits the user’s credentials to the third party service 
which verifies the credentials and returns the appropriate tokens. Koverse stores the tokens 
for the duration of the user’s session and uses them for any queries submitted by that user. 

6.4 Queries 

Users access data by submitting queries to the Koverse server. The Koverse web interface 
has built in search functionality that allows users to query data in a way that resembles an 
Internet search engine. Users can search for a term in any Field, specify a value for a Field, 
or search for a range of values in a particular Field. Koverse translates queries from the 
search application into a JavaScript Object Notation (JSON) [40] formatted list of Field 
names and values. Example search application queries with their respective JSON queries 
are shown in Figures 6.2 and 6.3 [41]. 
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mary 

"mary had a" 
name:mary 

name:mary occupation:shepherd 
height:[60 TO 70] 

Figure 6.2: Koverse Search application query examples, from [41]. 

{“$any”:“mary”} 

{“$any”:“mary had a”} 

‘name’ ’: ‘ ‘mary’ ’} 

{“$and’ ’: [{‘ ‘name’ ’: ‘ ‘mary’ ’}, {“occupation”: ‘‘shepherd’ ’}] } 

{“$and”: [{“height”:{“$gte:60”}},{“height”:{“$lte”:“70”}}]} 

Figure 6.3: Koverse JSON query examples, after [41]. 

After parsing user queries and verifying their syntax, Koverse generates an internal repre¬ 
sentation of the query that resembles a SQL “SELECT” statement. The Koverse query is 
stored in a Java SelectStatement object and has the format: 


SELECT(FieldNames,CollectionIDs.Expression,Offset,Limit) 


The CollectionIDs and FieldNames restrict the search to specific Collections and specific 
Fields. The Expression is a restriction on the Field values and mirrors the submitted query. 
Ojfset and Limit allow the user to control the range of results that are returned. An Ojfset 
of n ignores the first n results, and a Limit of m returns a maximum of m results. 

Once the query has been translated to a SelectStatement, it is executed in two stages. Eirst 
an Accumulo Scanner is created to scan the index table to quickly locate the required 
Records. Koverse uses the results of the index scan to create another Accumulo Scanner 
for the record table. The range of the record table Scanner is set based on the results of the 
index table scan. 
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6.5 Tokens 

Koverse has the ability to apply security labels at the Record level. When data is ingested 
in Koverse, the security label is applied as an Accumulo ColumnVisibility object (see Ta¬ 
ble 6.1). Although each Field in a Koverse Record is stored in a separate Accumulo col¬ 
umn, Koverse maintains a Record as a single entity, and all Accumulo entries from the 
same Record are assigned the same ColumnVisibility. User queries for restricted Records 
must include a set of access tokens. 

By default, security labels are not associated with any Koverse Records. If Record labels 
are desired, Koverse does not provide native support for user token management: this func¬ 
tionality requires interaction with a third party application. When the user authenticates, 
the third party application provides the proper set of tokens for that user. Those tokens 
are stored for the duration of the user’s Koverse session. Koverse uses these tokens in any 
query the user makes and constructs the appropriate Authorizations object for the Accu¬ 
mulo Scanner. 
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CHAPTER 7: 

Information Security Discussion 


Accumulo cell-level access control assists application developers with data access policy 
enforcement; however, it does not provide a complete information security solution. When 
describing Accumulo’s security capabilities in a PC World interview, Accumulo developer 
Adam Fuchs noted, “[s]ince the applications in this model can push down the security 
model into the database and companion components, you don’t have to solve that in the 
application” [42]. This statement, and similar ones from others in the Accumulo develop¬ 
ment community, may give developers a false sense of confidence in the level of security 
Accumulo can provide. Production applications must implement sound policy enforcement 
logic to integrate securely with Accumulo. In this chapter, we present potential security 
problems Accumulo client applications should consider. We do not provide an exhaustive 
list of all potential security concerns, but these examples should convince an application 
developer that information security is a significant problem that is not solved exclusively 
using native Accumulo functionality. 

7.1 User and Privilege Management 

Proper management of user accounts and their associated privileges is critical for the se¬ 
curity of any multi-user application. Functionality exists to manage users and privileges 
within Accumulo, but these interfaces are not likely to be used to manage client applica¬ 
tion users. As previously described, it is not expected that a large-scale Accumulo-based 
application will register a user account in Accumulo for every application user. Instead, 
Accumulo holds one user for the client application, which manages its own users sepa¬ 
rately. For clarity, in further discussion we refer to client application users as appusers and 
the Accumulo user as acmuser. Because many appusers are mapped onto one acmuser, 
there is no ability to differentiate between appusers at the Accumulo level. The client ap¬ 
plication authenticates to Accumulo as the acmuser, but must authenticate appusers and 
assign appropriate privileges prior to making any Accumulo requests. 

There are several types of privileges to consider in an Accumulo application. System per¬ 
missions give users the capability to perform administrative actions, such as creating or 
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deleting users aeeounts and granting privileges to users. Table permissions allow users to 
modify table entries or table metadata. Cell-level authorizations eontrol aeeess to individ¬ 
ual table entries. Eaeh type of privilege may be present in Aeeumulo, managed separately 
by the elient applieation, or both. 

Beeause there are many appusers that aeeess Aeeumulo through one acmuser, the acmuser 
must hold all privileges neeessary to perform Aeeumulo operations on behalf of any ap- 
puser. It beeomes the responsibility of the elient applieation to prevent appusers from 
using inappropriate acmuser privileges. The faet that privileges at the applieation level do 
not neeessarily map direetly to privileges in Aeeumulo adds eomplexity to the problem. 
For instanee, a eomplex data model at the applieation model may require a set of privileges 
that does not translate to Aeeumulo. In Koverse, data struetures ealled Collections seem to 
map elosely to Aeeumulo Tables. It may seem logieal then for any privileges assoeiated 
with a Koverse Collection to map to an Aeeumulo Table. Closer examination reveals that 
Koverse Collections do not direetly parallel Aeeumulo Tables. In faet, there are two Aeeu- 
mulo Tables used to store data regardless of the number of Collections ereated in Koverse. 
Any privileges in Koverse assoeiated with Collections management have no direet meaning 
in Aeeumulo. 

Cell-level Authorizations map more elosely from the applieation to Aeeumulo, but even 
these privileges may not translate direetly. In the Koverse data model, seeurity labels are 
applied at the Record level. When the Record is inserted into Aeeumulo, the Record is 
split into many entries and the seeurity label is modified prior to being applied to all entries 
from the Record. To manage Record level aeeess eontrol, Koverse utilizes a third-party 
serviee that provides a set of aeeess tokens for eaeh user. In the applieation, the appuser 
aeeesses a Record using these tokens, but at the Aeeumulo level, the acmuser aeeesses 
multiple entries with a set of Authorizations that are distinet from the tokens provided by 
the third-party serviee. 

The following example illustrates a user management seenario and highlights potential 
eomplieations. Consider a fietional enterprise human resouree information applieation, 
HRapp, illustrated in Figure 7.1. HRapp stores employee information in two Aeeumulo 
Tables—Employeeinfo and EmployeeSalary —to isolate sensitive salary information from 
more general personal information. The Tables are shown in a relational table format for 
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illustrative purposes. To understand the mapping to Aeeumulo entries, eonsider the en¬ 
try for Peter’s age in the Employeelnfo Table. The entry in Aeeumulo would have the 
following strueture: Row='V ColumnFamily="' ColumnQualifier=‘Age' ColumnVisibil- 
zty=‘SalesDiv’ Value=‘3A\ HRapp logie manages user aeeess to eaeh Table. 


i-iKapp use 

User 

r inTormaiion 

Table access 

Peter 

Employeelnfo 

Joanna 

Employeelnfo 

Anne 

Employeelnfo, EmployeeSalary 

Milton 

Employeelnfo, EmployeeSalary 

Bill 

Employeelnfo, EmployeeSalary 


Employeelnfo table 


Row 

Name 

1 Age 1 

Email 

1 

Peter [SalesDiv] 

34 [SalesDiv] 

peter@corp.net [SalesDiv] 

2 

Joanna [EngDiv] 

38 [EngDiv] 

joanna@corp.net [EngDiv] 

3 

Anne [SalesDiv] 

31 [SalesDiv] 

anne@corp.net [SalesDiv] 

4 

Bill [Execs] 

48 [Execs] 

bill@corp.net [Execs] 

5 

Milton [EngDiv] 

33 [EngDiv] 

milton@corp.net [EngDiv] 



User authentication 
and token retrieval 


Data retrieval 



EmployeeSalary table 


Row 

1 Name | 

Salary 

1 

Milton [EngDiv] 

62000 [EngDiv] 

2 

Joanna [EngDiv] 

63000 [EngDiv] 

3 

Anne [SalesDiv] 

65000 [SalesDiv] 

4 

Bill [Execs] 

64000 [Execs] 

5 

Peter [SalesDiv] 

47000 [SalesDiv] 


IDapp user information 


User 

Tokens 

Peter 

SalesDiv 

Joanna 

EngDiv 

Anne 

SalesDiv 

Milton 

EngDiv 

Bill 

SalesDiv,EngDiv 



Aeeumulo user privileges ' 

User 

1 Authorizations 

1 Employeelnfo 

1 EmployeeSalary 

root 

root 

read, write, drop 

read, write, drop 

acmuser 

SalesDiv, EngDiv 

read, write 

read,write 


Figure 7.1: HRapp example application. 


HRapp authentieates its users using a third-party serviee called IDapp that verifies user cre¬ 
dentials and returns an appropriate set of access tokens. HRapp interacts with Aeeumulo 
through a single user, acmuser, which has full access to all data. Aeeumulo ColumnVisibil- 
ities are applied based on the responsible organizational division— EngDiv for Engineering 
employees, SalesDiv for Sales employees, and Execs for Executives. In each division there 
is an operational manager and an human resources manager. The operational manager has 
access to general employee information for his division, and the human resources manager 
has access to both general and salary information for her division. Executives have access 
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to all employee information. 


When users log into HRapp, they provide a set of credentials. HRapp forwards these cre¬ 
dentials to IDapp which verifies the user identity and returns the user’s access tokens. 
HRapp stores tokens for the duration of the user’s session. When a user issues a query, 
HRapp creates an Accumulo request containing the appropriate table name and tokens. 
HRapp does not allow users to query tables they do not have access to. For instance, if user 
Peter, who is an operational manager, queries the Employeeinfo Table, he would receive 
the information in rows 1 and 3. If he queried the EmployeeSalary Table, his request would 
be denied. If user Milton, who is a human resources manager, queries the Employeeinfo 
Table, he would receive rows 2 and 4. He would have to issue a separate query to retrieve 
information in the EmployeeSalary Table and would receive rows 1 and 2. User Bill, an 
executive, can query either Table and would receive all rows. 

To enforce the above policy, HRapp must restrict user access to certain tables. The general 
application design —HRapp translates appuser requests into acmuser requests—essentially 
requires that table-level permissions be enforced by HRapp logic. The essential conflict 
stems from a combination of two application characteristics. First, the acmuser must have 
the ability to read all tables. Second, Authorizations do not specify table-level permissions 
in Accumulo. Thus, if appuser Johanna can cause HRapp to query the EmployeeSalary 
table, e.g., by misusing an interface, then the previously described access control policy 
will be violated. It is not enough, in this example, for HRapp to properly associate access 
tokens with each appuser via IDapp and rely on Accumulo to enforce all access control 
policy requirements. The ambient authority that allows acmuser to read all tables could 
be abused if HRapp fails to enforce table-level policies properly. In theory, table-level 
enforcement could be pushed to Accumulo, if appuser access tokens were table specific. 
Following our example, correcting this problem would require an additional term added to 
the ColumnVisibilities in the EmployeeSalary Table (e.g., [EngDiv&Salary]) and updating 
the appropriate user tokens in IDapp. This adds an additional layer of administration and 
complexity when adding new application users or new database entries. 
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7.2 NoSQL Injection 

Injection attacks are a common exploitation vector, particularly in web applications. They 
are commonly used to retrieve sensitive or restricted data from application databases and 
have been identified as a significant information security concern [43]. Injection attacks 
typically occur when an application accepts user input insecurely. Attackers can craft input 
in such a way that forces the application server to perform actions that are not meant to be 
available to normal users. Injection attacks can allow attackers to perform any action of 
their choosing on the database, including reading, writing, inserting or deleting arbitrary 
data. 

Injections attacks against SQL databases have been well explored, but similar attacks have 
also been reported in the expanding NoSQL database community. The OWASP organiza¬ 
tion proposes that NoSQL injection attacks may have more significant impact than SQL 
injection because they are executed in a lower level procedural API [44]. Table 7.1 sum¬ 
marizes potential vulnerabilities for some common NoSQL databases. We leave rigorous 
examination of the applicability of these attacks to Accumulo as an open problem, but it is 
important to note that divorcing an application from SQL databases does not remove the 
potential for injection attacks. 
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Name 

Type 

Interface Languages 

Documented Query- 
Language Attacks 

Cassandra 

column 

CQL, drivers available for Java, 
C#, Python 

manual construction of 
query strings [45] 

MongoDB 

document 

JavaScript, drivers available for 
many common languages 

“$where” attacks [46]- 
[48] 

Redis 

key/value 

standard map manipulation 
commands (e.g., GET, SET), 
drivers available for many 
common languages 

redisCommandO 
attack [49] 

CouchDB 

document 

HTTP, JavaScript 

JavaScript injection, 
file system traversal, 
XSS [50]-[52] 

Tokyo Cabinet 

key/value 

C, Perl, Ruby, Java, Lua 

binary protocol in¬ 
jection vulnerabilities 
[53] 


Table 7.1: Summary of NoSQL stores and documented query language vulnerabilities. 


7.3 Information Security Policy Enforcement 

Terminology used in the Accumulo development community may give a false impression 
of Accumulo’s security policy enforcement capability. Descriptions of Accumulo fre¬ 
quently contain terms and phrases that are typically associated with Mandatory Access 
Control (MAC) policies, for example: “mandatory attribute-based access controF [28], 
access control through object “labels” [54], multiple “security levels” [29] or “security 
classifications” [10] stored together, and “intermingling data sets” [55]. According to the 
DOD Trusted Computer System Evaluation Criteria (TCSEC) standard, the only systems 
associated with labeling are class B1 and above, where those labels “shall be used as the 
basis for mandatory access control” [56]. This statement suggests that the use of data 
labeling is highly correlated with mandatory access control policies. The use of MAC ter¬ 
minology suggests that Accumulo can enforce an information flow control policy; however, 
Accumulo’s native functionality cannot enforce such a policy. 

MAC is an access control and information flow control policy that uses labels to restrict 
access to objects based on a comparison of the subject and object security level. The 
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TCSEC standard states than in order to enforee a mandatory seeurity poliey, these labels 
must be applied to eaeh objeet in the system and must reliably identify the the objeets’ 
sensitivity levels [56]. A MAC poliey does not dietate aeeess rules for individual subjeets, 
but relies on labels to enforee aeeess eontrol. This stands in eontrast to a Diseretionary 
Aeeess Control (DAC) poliey that maintains a set of objeet aeeess rights for eaeh subjeet 
and allows subjeets to grant and revoke aeeess to other subjeets for objeets they own [57] . 

An immediate indieation of Aeeumulo’s inability to enforee information flow polieies is 
the absenee of a lattiee-based ordering of Aeeumulo labels. A key feature of a MAC poliey 
is a lattiee framework eonstrueted by a partially ordered set of seeurity levels [58]. The 
lattiee is neeessary for determining dominanee between two different seeurity levels. The 
dominanee property determines if a subjeet is authorized to perform an aetion. Sandhu 
(1996) implements a mandatory poliey using “role hierarehies” in a lattiee framework [59]. 
In Aeeumulo, ColumnVisibilities are used to label data, but no meehanism exists for deter¬ 
mining ordering of eell-level labels. Aeeess to Aeeumulo entries is based on a byte-by-byte 
eomparison of the boolean ColumnVisibility expression to user Authorizations. A elient ap- 
plieation would need to provided additional logie to determine ordering. 

To further illustrate Aeeumulo’s inability to enforee MAC, we eonsider the Bell-LaPadula 
[60] model, a well understood MAC poliey. Bell-LaPadula identifies three properties that 
a seeure system should exhibit. The simple seeurity property requires that no user ean read 
data with a higher elassifieation than the user’s seeurity level. This property is eommonly 
referred to as “no read up.’’ The star property requires that no user ean write data with a 
lower elassifieation than the user’s seeurity level. This property is eommonly referred to as 
“no write down.” The tranquility property requires that no user ean modify the elassifiea¬ 
tion level of data [57]. Aeeumulo’s ColumnVisibilities may be able to enforee the simple 
seeurity property. With proper assignment of ColumnVisibilities to data and Authorizations 
to users, Aeeumulo ean ensure that users do not aeeess unauthorized data (i.e. data for 
whieh the user does not hold the appropriate set of Authorizations). For instanee, a user 
with a SECRET token would not be able to aeeess TOP SECRET data; however, beeause 
Aeeumulo does not impose order on ColumnVisibilities, that user would also not be aeeess 
UNCEASSIFIED data. To implement this ability, a user with SECRET elearanee would 
need Authorizations that inelude both SECRET and UNCEASSIFIED tokens. This may or 


49 



may not be a desired property in any particular implementation but illustrates a potential 
problem. 

Accumulo does not enforce access controls on write operations in the same way as read 
operations. By default, there is no user Authorizations check when writing data. Any 
user can write data with any ColumnVisibility value. This may violate the star property. 
A user can read data with a high security label and write identical data to a cell with a 
lower label. This problem is also referred to as leakage in some literature. If the Row, 
ColumnFamily, and ColumnQualifier of the data are kept the same, this scenario would 
also violate the tranquility property. The old entry would be effectively re-labeled with a 
lower classification. Without restrictions administered by the client application, any user 
with write access can effectively downgrade the classification of any data. 

Accumulo has an optional configuration setting that can be applied at the table level that 
prevents users from writing data with ColumnVisibilities that are not part of their Authoriza¬ 
tions set. Recall that during read operations, Accumulo verifies that the subset of Autho¬ 
rizations provided in the query satisfies the ColumnVisibility associated with the requested 
data. Accumulo does not perform this check during write operations. Instead, Accumulo 
simply verifies that the appropriate Authorizations are associated with the user. In the 
recommended use case, in which all appusers operate through one acmuser, this check 
provides no protection. When the request reaches Accumulo, it is executed by acmuser, 
which holds the entire domain of Authorizations necessary for all appusers. Therefore, 
any appuser could write data with any ColumnVisibility within the domain. The constraint 
would, however, prevent appusers from writing data with nonsensical ColumnVisibilities 
in the context of the application. 

Accumulo’s loose restrictions on write operations prevent it from enforcing useful MAC 
properties. If a data access policy allows only read operations, Accumulo could be used 
to enforce the simple security property, but would likely require the client application to 
provide some additional logic to fully implement an ordered lattice framework, especially 
in the scenario in which multiple client application users are mapped to one Accumulo user. 
If client application users routinely write to the database, Accumulo could provide only 
DAC enforcement, and logic needed to enforce a MAC policy would have to be provided 
by the client application. 
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CHAPTER 8: 

Conclusion and Future Work 


In this thesis, we studied Apaehe Aeeumulo’s eell-level aeeess eontrol. This fine-grained 
aeeess eontrol ean be used in data sets with varying degrees of sensitivity to maximize 
aeeessibility while maintaining the required level of seereey. This seeurity feature gives 
Aeeumulo a unique position in the quiekly expanding NoSQL eeosystem and is partieularly 
interesting for the DOD where it is being integrated into projeets like the Naval Taetieal 
Cloud. 


8.1 Conclusions 

We employed statie analysis of souree eode to gain detailed insight into Aeeumulo’s eell- 
level aeeess eontrol enforeement. We illustrated the exeeution path of a query starting 
at the elient Scanner interfaee and ending at the enforeement point in the TabletServer. 
We formalized the syntax for a ColumnVisibility label and showed how Authorizations are 
eompared to ColumnVisibility expressions to filter query results. These details provide 
more insight into Aeeumulo’s seeurity poliey enforeement meehanisms that ean be used 
for further study. 

After understanding low-level details of Aeeumulo poliey enforeement, we showed how 
Aeeumulo eould be integrated into a larger applieation. We highlighted important inter¬ 
faces in the client library needed to perform basic read and write operations. We identified 
several examples of applications that use Aeeumulo and detailed Koverse operation as a 
case study. We used Koverse to show how an application could develop a custom data 
model and map it to Aeeumulo. Most importantly, we showed how Aeeumulo’s recom¬ 
mended user organization (multiple application users mapped to one Aeeumulo user) is 
implemented in practice. We showed how a custom application query can be translated to 
Aeeumulo queries. Although Koverse does not implement fine grained security by default, 
we showed how that functionality would interact with Aeeumulo if used. The Koverse case 
study gives readers a basic understanding of application integration with Aeeumulo. Our 
work can be interpreted as a first step toward a thorough analysis of Aeeumulo information 
security enforcement. Understanding the interaction between Koverse and Aeeumulo is 


51 




particularly useful for readers who are concerned with how Accumulo may benefit security 
of sensitive DOD information. 

We commented on potential security threats facing developers that build applications based 
on Accumulo. We used a hypothetical application to illustrate potential user management 
concerns. We identified injection attacks that have been carried out against other NoSQL 
databases and may be relevant to some uses of Accumulo. We commented on Accumulo’s 
inability to enforce information flow policies. These examples serve to demonstrate that 
using Accumulo and it’s cell-level security feature is not a full solution to access con¬ 
trol problems unless Accumulo is paired with well-designed enforcement mechanisms in 
the client application. We believe that the combination of our technical discussion of Ac¬ 
cumulo’s cell-level access control enforcement, illustration of Accumulo integration in a 
larger application, and identification of potential security concerns may help future studies 
learn more about Accumulo information security and lead to development of more secure 
Accumulo based applications. 

8.2 Future Work 

The scope of this thesis was limited primarily to static analysis of Accumulo source code. 
We were able to provide a detailed description of Accumulo’s security policy enforcement 
using this method, but there are other methods that could be used to further investigate 
information security in Accumulo. Potential areas for future research include: 

Application vulnerability analysis 

More detailed analysis could be done to determine if specific instantiations and con¬ 
figurations of Accumulo have any vulnerabilities that may lead to a security com¬ 
promise. For instance, in Chapter 2 we list several known injection attacks against 
NoSQL databases, and follow-on studies could determine if these are applicable to 
Accumulo. A starting point for such studies could be an open source JSON interface 
for Accumulo called Jaccson [61]. According to its documentation, Jaccson’s design 
is based on MongoDB’s API, and therefore, may be susceptible to attacks similar to 
“$where” attacks used against MongoDB. 

In addition to analysis of known attacks, future research could attempt to identify 
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Accumulo specific vulnerabilities using penetration testing tools such as OWASP 
Zed Attack Proxy or fuzzing tools. Many of these tools are protocol specific, so 
efforts could be made to adapt the general approach of a specific tool to testing of 
Accumulo or Accumulo based applications. Web applications are the most frequent 
targets for injection attacks and both Accumulo and HDFS supply web interfaces to 
monitor system performance. Koverse also provides a web interface and is a compo¬ 
nent of NTC. As Accumulo becomes more popular, there may be more large scale 
applications available for testing. 

Network traffic analysis 

Accumulo components reside on disjoint physical machines and must communicate 
across a network. Current versions of Accumulo communicate largely through re¬ 
mote procedure calls over TCP/IP via Apache Thrift’s network stack [62]. If these 
communications are insecure, they could leak sensitive information. Future stud¬ 
ies could analyze all network traffic generated by Accumulo components, determine 
what information is transmitted, and identify default communication security set¬ 
tings. Based on this traffic analysis, researchers could determine what information 
may be at risk and recommend vulnerability mitigation strategies. 

Best practice configuration settings 

The NoSQL ecosystem is relatively new and availability of security best practices 
is limited. Future work could include a survey of NoSQL databases to determine 
configuration properties that are security relevant. It may be possible to develop a 
general set of security related best practices for NoSQL systems, but the wide range 
of systems that fall under the NoSQL umbrella my require generalization to the point 
of triviality. In any case, the development of a set of best practices specific for Accu¬ 
mulo should be feasible. 

Information fiow control 

We showed that Accumulo is not capable of enforcing information flow control poli¬ 
cies without additional logic. Further research could propose how to achieve manda¬ 
tory access control policy enforcement in an Accumulo application. One promising 
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area of research is using the NSA’s Cloud Security Gateway and Trusted Data Format 
to implement a integrity lock [63] style architecture. Another method could modify 
Accumulo to rely on a trusted operating system to enforce information flow policy 
following approaches explored by Nguyen et al. [64] and Roy et al. [65]. A success¬ 
ful study could validate the use of Accumulo in cross-domain DOD applications. 
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APPENDIX: Accumulo Installation 


This guide covers the installation/configuration of Hadoop, Zookeeper and Accumulo. 
These instructions were tested using a fresh install of Ubuntu-12.04 LTS (64-bit). 

Install Hadoop 1.2.1 

This guide will install and configure a single node pseudo-distributed version of Hadoop. 

1. Install Java 

$ sudo apt-get install openjdk-6-jdk 

$ java -version 

java version "1.6.0_27" 

OpenJDK Runtime Environment (IcedTea6 1.12.6) \ 

(6b27-l.12.6-lubuntuO.12.04.2) 

OpenJDK 64-Bit Server VM (build 20.0-bl2, mixed mode) 

2. Disable ipv6 (recommended by many Hadoop users) 

$ sudo vi /etc/sysctl.conf 

Add the following lines to the end: 

#disable ipv6 

net.ipv6.conf.all.disable_ipv6 = 1 
net.ipv6.conf.default.disable_ipv6 = 1 
net.ipv6.conf.lo.disable_ipv6 = 1 

3. Download Hadoop 1.2.1 from one of the Apache mirrors \ and unpack it. 

$ wget http://goo.gl/0oR9TS -0 hadoop-1.2.1.tar.gz 
$ tar xzf hadoop-1.2.1.tar.gz 

4. Define JAVA_H0ME as the root of your Java installation. 

$ vi hadoop-1.2.1/conf/hadoop-env.sh 

'See http://hadoop.apache.org/releases.html 
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Adjust the following line: 

export JAVA_H0ME=/usr/lib/jvin/java-6-openjdk-amd64 

5. Configure Hadoop Edit hadoop-1.2.1/conf/core-site .xml to reflect: 

<configuration> 

<property> 

<name>fs.default.name</name> 

<value>hdfs://localhost:9000</value> 

</property> 

</configuration> 

Edit hadoop-1.2.1/conf/hdfs-site .xml to reflect: 

<configuration> 

<property> 

<name>dfs.replication</name> 

<value>l</value> 

</property> 

<property> 

<name>dfs.support.append</name> 

<value>true</value> 

</property> 

</configuration> 

Edit hadoop-1.2.1/conf/mapred-site .xml to reflect: 

<configuration> 

<property> 

<name>mapred.job.tracker</name> 

<value>localhost:9001</value> 

</property> 

</configuration> 

6. Configure ssh to be passwordless. Test to see if a password is required, using the 
command: 

$ ssh localhost 
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If you can’t ssh into localhost without a password execute the following: 


$ ssh-keygen -t dsa -P ’’ -f ~/.ssh/id_dsa 
$ cat ~/.ssh/id_dsa.pub » ~/.ssh/authorized_keys 


Install Zookeeper 3.4.5 

This guide will install and configure Zookeeper for standalone operation. 

1. Download Zookeeper from one of the Apache mirrors^, and unpack it. 

$ wget http://goo.gl/lFQoec -0 zookeeper-3.4.5.tar.gz 
$ tar xzvf zookeeper-3.4.5.tar.gz 

2. Create the configuration file zookeeper-3.4.5/conf/zoo. cfg: 

tickTiine=2000 

dataDir=/var/lib/zookeeper 

clientPort=2181 

maxClientCnxns=100 

The dataDir should point to an existing empty directory: 

sudo mkdir /var/lib/zookeeper 

sudo chown ‘whoami' /var/lib/zookeeper 


Install Accumulo 1.5.0 

This guide will install and configure Accumulo for a single computer. 

1. Download Accumulo source from one of the Apache mirrors^, and unpack it. 

$ wget http://goo.gl/inG73aD -0 accumulo-1.5.0-src.tar.gz 
$ tar xzvf accumulo-1.5.0-src.tar.gz 

2. Build Accumulo. 

$ sudo apt-get install maven 
$ cd accumulo-1.5.0 

^See http://zookeeper.apache.org/releases.html 
^See http://accumulo.apache.org/downloads/ 
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$ mvn package -P assemble 
$ cd . . 


3. Copy configuration files to conf directory. 

$ cp accumulo-1.5.0/conf/examples/512MB/native-standalone/* \ 
accumulo-1.5.0/conf 

4. Set JAVA.HOME, HADOOP.HOME, and ZOOKEEPER.HOME: 

$ vi accumulo-1.5.0/conf/accumulo-env.sh 

In particular, any lines featuring these exports should read: 

export Z00KEEPER_H0ME=<your path>/zookeeper-3.4.5 
export HADOOP_PREFIX=<your path>/hadoop-l.2.1 
export JAVA_HOME=<your path, same as above for Hadoop> 
test -z "$ACCUMUL0_H0ME" && \ 

export ACCUMULO_HOME=<your path>/accumulo-l.5.0 

5. Create the directory indicated by the path variable ACCUMULO_LOG_DIR. This path is 
defined in the configuration script accumulo-1.5.0/conf/accumulo-env.sh. For 
example: 

$ mkdir accumulo-1.5.0/logs 

6. Accumulo requires the Hadoop “commons-io” java package. This is normally dis¬ 
tributed with Hadoop. It should be located at hadoop-1.2. l/lib/commons-io-2.1 
If your Hadoop distribution does not provide this package, you will need to obtain it 
and put the “commons-io” jar file under accumulo-1.5.0/lib. 

Starting Accumulo 

Use the following steps to start the Accumulo instance, to verify installation. 

1. Start Hadoop 

$ hadoop-l.2.l/bin/hadoop namenode -format 
$ hadoop-l.2.1/bin/start-all.sh 


jar. 
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2. Verify Hadoop is running by browsing the following web interfaees. If you ean 
eonneet to these pages, Hadoop is running: 

$ lynx http://localhost:50070/ 

$ lynx http://localhost:50030/ 

3. Start Zookeeper 

$ zookeeper-3.4.5/bin/zkServer.sh start 

4. Verify Zookeeper is running by eonneeting to the shell 

$ zookeeper-3.4.5/bin/zkCli.sh -server 127.0.0.1:2181 

You should a eommand prompt, like: 

[zk: 127.0.0.1:2181(CONNECTED) 0] 

To exit the shell, type ‘quit’. 

5. Initialize Aeeumulo, to ereate an instanee name and root password. 

$ accumulo-l.5.0/bin/accumulo init 

6. Start Aeeumulo 

$ accumulo-l.5.0/bin/start-all.sh 

7. Verify Aeeumulo running by browsing: 

$ lynx http://localhost:50095/ 

Alternatively, verify Aeeumulo running by eonneeting to the shell: 

$ accumulo-l.5.0/bin/accumulo shell -u root 

Enter the root password you just ereated. You should see the prompt 
root@accumulo> 

Exit the shell by typing ‘quit’. 
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stopping Accumulo 

The following commands can be used to stop the running Accumulo instance: 


$ accumulo-l.5.0/bin/stop-all.sh 
$ zookeeper-3.4.5/bin/zkServer.sh stop 
$ hadoop-1.2.1/bin/stop-all.sh 
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